10
Introduction
where k is the full precision kernels, w is the reconstructed matrix, v is the variance of y,
μ is the mean of the kernels, Ψ is the covariance of the kernels, fm are the features of class
m, and c is the mean of fm.
Zheng et al. [288] define a new quantization loss between binary weights and learned real
values, where they theoretically prove the necessity of minimizing the weight quantization
loss. Ding et al. [56] propose using distribution loss to explicitly regularize the activation
flow and develop a framework to formulate the loss systematically. Empirical results show
that the proposed distribution loss is robust to selecting the training hyper-parameters.
Reviewing these methods, they all aim to minimize the error and information loss of
quantization, which improves the compactness and capacity of 1-bit CNNs.
1.1.6
Neural Architecture Search
Neural architecture search (NAS) has attracted significant attention with remarkable perfor-
mance in various deep learning tasks. Impressive results have been shown for reinforcement
learning (RL), for example,[306]. Recent methods such as differentiable architecture search
(DARTs) [151] reduce search time by formulating the task in a differentiable manner. To
reduce redundancy in the network space, partially connected DARTs (PC-DARTs) were
recently introduced to perform a more efficient search without compromising DARTS per-
formance [265].
In Binarized Neural Architecture Search (BNAS) [35], the neural architecture search
is used to search BNNs, and the BNNs obtained by BNAS can outperform conventional
models by a large margin. Another natural approach is to use 1-bit CNNs to reduce the
computation and memory cost of NAS by taking advantage of the strengths of each in a
unified framework [304]. To accomplish this, a Child-Parent (CP) model is introduced to a
differentiable NAS to search the binarized architecture (Child) under the supervision of a
full precision model (Parent). In the search stage, the Child-Parent model uses an indicator
generated by the accuracy of the Child-Parent (cp) model to evaluate the performance
and abandon operations with less potential. In the training stage, a kernel-level CP loss
is introduced to optimize the binarized network. Extensive experiments demonstrate that
the proposed CP-NAS achieves a comparable accuracy with traditional NAS on both the
CIFAR and ImageNet databases.
Unlike conventional convolutions, BNAS is achieved by transforming all convolutions in
the search space O into binarized convolutions. They denote the full-precision and binarized
kernels as X and ˆX, respectively. A convolution operation in O is represented as Bj =
Bi ⊗ˆX, where ⊗denotes convolution. To build BNAS, a key step is to binarize the kernels
from X to ˆX, which can be implemented based on state-of-the-art BNNs, such as XNOR
or PCNN. To solve this, they introduce channel sampling and reduction in operating space
into differentiable NAS to significantly reduce the cost of GPU hours, leading to an efficient
BNAS.
1.1.7
Optimization
Researchers also explore new training methods to improve BNN performance. These meth-
ods are designed to handle the drawbacks of BNNs. Some borrow popular techniques from
other fields and integrate them into BNNs, while others make changes based on classical
BNNs training, such as improving the optimizer.
Sari et al. [234] find that the BatchNorm layer plays a significant role in avoiding explod-
ing gradients, so the standard initialization methods developed for full-precision networks
are irrelevant for BNNs. They also break down BatchNorm components into centering and